High-throughput sequencing
and genomes

Jelmer Poelstra

MCIC Wooster, OSU

2024-01-25

Intro to sequencing technologies

What does sequencing refer to?

The shorthand sequencing, like in “high-throughput sequencing” in the title of this presentation, generally refers to determining the nucleotide sequence of fragments of DNA.


What about RNA or proteins?

  • RNA is usually reverse transcribed to DNA (cDNA) prior to sequencing, as in nearly all “RNA-seq”.

    Direct RNA sequencing is possible with one of the sequencing technologies we’ll discuss, but this is under development and not yet widely used.


  • Protein sequencing requires different technology altogether, such as mass spectrometry.

Sequencing technologies: overview

  • Sanger sequencing (since 1977)
    Sequences a single, typically PCR-amplified, short-ish (≤900 bp) DNA fragment at a time

High-throughput sequencing (HTS)
Sequences 105-109, usually randomly selected, DNA fragments (“reads”) at a time — two types:


  • Short-read HTS
    • AKA Next-Generation Sequencing (NGS)
    • Produces up to billions of 50-300 bp reads
    • Market dominated by Illumina
    • Since 2005 — technology stable

Sequencing technologies: overview

  • Sanger sequencing (since 1977)
    Sequences a single, typically PCR-amplified, short-ish (≤900 bp) DNA fragment at a time

High-throughput sequencing (HTS)
Sequences 105-109, usually randomly selected, DNA fragments (“reads”) at a time — two types:


  • Short-read HTS
    • AKA Next-Generation Sequencing (NGS)
    • Produces up to billions of 50-300 bp reads
    • Market dominated by Illumina
    • Since 2005 — technology stable
  • Long-read HTS
    • Reads much longer than in NGS but fewer, less accurate, and more costly per base
    • Two main companies: Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio)
    • Since 2011 — remains under rapid development

Sequencing technology development timeline

Modified after Pereira et al. 2020

Sequencing technology development timeline

Modified after Pereira et al. 2020

Sanger sequencing

Sequences a single, typically PCR-amplified, short-ish (≤900 bp) DNA fragment at a time.

Sequencing is performed by synthesizing a new DNA strand in part with fluorescently-labeled nucleotides — a different color for each base (A, C, G, T).


The final result is a chromatogram that can be base-called:

https://dnacore.mgh.harvard.edu/new-cgi-bin/site/pages/sequencing_pages/seq_troubleshooting.jsp


The entire human genome (3 Gbp) was sequenced with Sanger technology!

Anyone want to hazard a guess how much this cost, approximately?

Sequencing cost through time

https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost

Present-day Sanger applications

With the advent of NGS, Sanger sequencing has become much less common but is not obsolete.


Some present-day applications of Sanger sequencing include:

  • Examining variation among individuals or populations in one or more candidate or marker genes (for population genetics, phylogenetics, functional inferences, etc.)

  • Taxonomic identification of samples

High-throughput sequencing (HTS)

High-throughput sequencing applications

  • Whole-genome assembly
  • Variant analysis (for population genetics/genomics, molecular evolution, GWAS, etc.):

    • Whole-genome “resequencing”

    • Reduced-representation libraries (e.g. RADseq, GBS)

  • RNA-seq (transcriptome analysis)

  • Other functional sequencing methods like methylation sequencing, ChIP-seq, etc.

  • Microbial community characterization

    • Metabarcoding

    • Shotgun metagenomics

Two important variables in high-throughput sequencing:
read lengths & error rates

Read lengths

  • Short-read (Illumina) HTS: 50-300 bp reads

  • Long-read HTS: longer & more variable read lengths (PacBio: 10-50 kbp, ONT: 10-100 kbp)


When are longer reads useful?
  • Genome assembly

  • Haplotype and large structural variant calling

  • Transcript isoform identification

  • Taxonomic identification of single reads (microbial metabarcoding)


When does read length not matter (as much)?
  • SNP variant analysis

  • Read-as-a-tag: the goal is just to know a read’s origin in a reference genome, like in counting applications such as RNA-seq

Error rates

Currently, no sequencing technology is error-free, and several types of errors can occur:

  • Base call errors, e.g. a base that was called as an A may instead be a G.

  • Insertion or deletion (indel) errors

  • When the base calling software is not confident at all, it can also return Ns (= undetermined).

Quality scores in sequence data

When you get sequences from a high-throughput sequencer, base calls have typically already been made. Every base is also accompanied by a quality score (inversely related to the estimated error probability).

Overcoming sequencing errors

  • Sequencing every bases multiple times, i.e. having a >1x so-called “depth of coverage” allows to infer the correct sequence

  • But overcoming sequencing errors is made more challenging by natural genetic variation among and within (heterozygosity due to diploid genomes) individuals

  • Typical depths of coverage: at least 50-100x for genome assembly; 10-30x for resequencing.

Illumina HTS

Illumina (short-read HTS / NGS)

  • 100-300 bp reads with 0.1-0.2% error rates

  • More reads, lower per-base cost, and lower error rates than long-read sequencing1.

  • Machines differ in throughput, read length, cost per Gb:

Libraries and library prep

In a sequencing context, a “library” is a collection of nucleic acid fragments ready for sequencing.

In Illumina and other HTS libraries, these fragments number in the millions or billions and are often randomly generated from input such as genomic DNA:

This procedure is called library prep, and is typically done for you by a sequencing facility or company.

Different library prep procedures are used depending on the type of sequencing (WGS, RAD-seq, RNA-seq, etc.) and HTS technology — and some include more specific fragment generation or selection.
We’ll see the specific library prep steps for RNA-seq next week.

Libraries and library prep (cont.)

After library prep (here, for Illumina sequencing), each DNA fragment is flanked by several types of short sequences that together make up the “adapters”:



Adapter components?

After talking about paired-end vs. single-end sequencing and the way Illumina sequencing works, we’ll take a closer look at the individual components of adapters.

Paired-end vs. single-end sequencing

In Illumina sequencing, DNA fragments can be sequenced from both ends as shown below —
this is called “paired-end” (PE) sequencing:


When sequencing is instead single-end (SE), no reverse read is produced:

Paired-end sequencing

  • Paired-end sequencing is a way to effectively increase the read length. (In the resulting sequence files, the two reads in each pair are separate, but they can be matched thanks to shared read IDs.)

  • Earlier, we saw that the maximum read length of Illumina is 300 bp but in paired-end sequencing, this becomes “2 x 300 bp”, etc.

  • The total size of the biological DNA fragment (without adapters) is often called the insert size:

Insert size variation

Insert size varies — because the library prep protocol can aim for various sizes, and because of variation due to limited precision in size selection. In some case, the insert size can be:

  • Shorter than the combined read length, leading to overlapping reads (this can be useful):

  • Shorter than the single read length, leading to “adapter read-through
    (i.e., the ends of the resulting reads will consist of adapter sequence, which should be removed):

How Illumina sequencing works

First, library fragments bind to a surface thanks to the adapters, and the DNA templates (the biological sequences) are then PCR-amplified to form “clusters” of identical fragments:

In the diagram above, for illustrative purposes:

  • Only a few nucleotides are shown (1 block = 1 nucleotide) — in reality, fragments are much longer

  • Only two templates => clusters are shown — in reality, there are millions

How Illumina sequencing works (cont.)

Then, sequencing is performed by synthesizing a new strand using fluorescently-labeled bases and taking a picture each time a new nucleotide is incorporated:

How Illumina sequencing works (cont.)

How errors come about in Illumina

  • The different templates within a cluster get out of sync because occasionally:

    • They miss a base incorporation

    • They incorporate two bases at once


  • Base incorporation may also terminate before the end of the template is reached

This error profile is why, for Illumina:

  • There are hard limits on read lengths

  • Base quality scores typically decrease along the read

How Illumina sequencing works: Zooming out

How Illumina sequencing works: Zooming out

A closer look at the adapter components

Now that you have a better idea of how Illumina sequencing works, let’s briefly revisit the adapters flanking the DNA, and see their different components:


A closer look at the adapter components

Now that you have a better idea of how Illumina sequencing works, let’s briefly revisit the adapters flanking the DNA, and see their different components:


Multiplexing!

Using the indices/barcodes in adapters, up to 96 samples can be multiplexed into a single library.

Long-read HTS

Long-read HTS

The technologies underlying the two main long-read HTS technologies are very different, but have some commonalities beyond long reads — they:


  • Perform “single-molecule” sequencing (no PCR amplification of library fragments)
  • Therefore require higher quality & quantity of DNA
  • Can detect some base modifications, like methyl groups

Error rates are changing

As a shorthand that was universally true until recently, I mentioned earlier that long-read HTS has higher error rate than short-read (Illumina) HTS.

However, error rates in one type of PacBio sequencing where individual fragments are sequenced multiple times (“HiFi”) are now lower than in Illumina.

Nanopore sequencing

A single strand of DNA passes through a nanopore
the electrical current is measured, which depends on the combination of bases passes in the pore:

https://www.genome.gov/genetics-glossary/Nanopore-DNA-Sequencing

See also this short video: https://www.youtube.com/watch?v=RcP85JHLmnI

ONT (Nanopore) sequencers

Under development!

ONT constantly releases new flow cells with updated technology, which have led to large decreases in error rates over the past decade — and even over the past two or so years.

There is also a lot of development in ONT base-calling software so it is useful to receive and keep pre-basecall files: re-basecalling a few years later with updated software can make a difference.

ONT vs. PacBIO

Advantages of ONT:

  • Low capital cost, portability (in-the-field sequencing!)

  • Read length not inherently limited, some extremely long reads

  • Lower cost per base

  • Can sequence RNA directly (but still under development)


Disadvantages of ONT:

  • Higher error rates

  • Some systematic errors (e.g. homopolymers)

(Reference) Genomes

Genomes

As methods facilitating genomics and transcriptomics research, genomes loom large in HTS. Specifically, most HTS applications either require a “reference genome” or involve its production.


What exactly does “reference genome” refer to? We’ll discuss three components to this phrase:

  • Assembly
    It includes a representation of most of the genome DNA sequence: the genome assembly
  • Annotation
    It (preferably) includes an “annotation” that provides the locations of genes and other genomic features, as well as functional information on these features
  • Taxonomic identity
    Typically considered at the species level, so then it should involve the focal species. But:

    • If necessary, it is often possible to work with reference genomes of closely related species

    • Conversely, multiple reference genomes may exist, e.g. for different subspecies/populations

Genome size variation

https://en.wikipedia.org/wiki/Genome_size

Genome structure

https://en.wikipedi.org/wiki/Karyotype





Key features:

  • Number of distinct chromosomes

  • Ploidy

Growth of genome databases

Konkel & Slot 2023

Genome assemblies

  • With increasing usage & quality of long-read HTS, we are generating better assemblies

  • For chromosome-level assemblies, i.e. with one contiguous sequence for each chromosome, additional technologies than sequencing are often needed (e.g. Hi-C, optical mapping)

  • Many assemblies are not “chromosome-level”, but consist of –often 1000s of– contigs and scaffolds.

  • Even chromosome-level assemblies are not 100% complete (and contain “unplaced” scaffolds)


Question: Contigs vs. scaffolds?

Contigs are contiguous, known stretches of DNA created by the assembly process, basically by overlapping reads.

Often, the order and orientation of two or more contigs is known, but there is a gap of unknown size between them. Such contigs are connected into scaffolds with a stretch of Ns in between.

Genome annotations

  • Annotating a genome consists of two main steps:

    • Structural annotation
      The identification of genes and other genomic features within the genome sequence

    • Functional annotation
      Giving names & assigning functions to (mostly) genes

  • Genome annotation heavily relies on information from other organisms’ genomes, lifting over annotations based on the concept of sequence homology.


How is this data stored?

Both genome assemblies and annotations are typically saved in a single text file each — more on that soon.

Appendix: Sequence data files

Overview

All common genetic/genomic data files are plain-text, meaning that they can be opened by any text editor. However, they are often compressed to save space. The main types are:

  • FASTA
    Simple sequence files, where each entry contains a header and a DNA/AA sequence.
    Versatile, anything from a genome assemblies, proteomes, and single sequence fragments to alignments can be in this format.

  • FASTQ
    The standard format for HTS reads — contains a quality score for each nucleotide.

  • SAM/BAM
    An alignment format for HTS reads


  • GTF/GFF
    Tables (tab-delimited) with information such as genomic coordinates on “genomic features” such as genes and exons. The files contain reference genome annotations.

FASTA files

FASTA files contain one or more (sometimes called multi-FASTA) DNA or amino acid sequences, with no limits on the number of sequences or the sequence lengths.


As mentioned, they are versatile, and are the standard format for:

  • Genome assembly sequences

  • Transcriptomes and proteomes (all of an organism’s transcripts & amino acid sequences, resp.)

  • Sequence downloads from NCBI such as a single gene/protein or other GenBank entry

  • Sequence alignments (but not from HTS reads)

FASTA files (cont.)

The following example FASTA file contains two entries:

>unique_sequence_ID Optional description
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAA
>unique_sequence_ID2
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAATG

Each entry contains a header and the sequence itself, and:

  • Header lines start with a > and are otherwise “free form” but usually provide an identifier (and sometimes metadata) for the sequence
  • A single sequence is often not on a single line, but spread across multiple lines with a fixed width

FASTA file name extensions are variable:

  • Generic extensions are .fasta and .fa

  • Also used are extensions that explicitly indicate whether sequences are nucleotide (.fna) or amino acids (.faa)

FASTQ

FASTQ is the standard format for HTS reads.
Each read forms one FASTQ entry and is represented by four lines, which contain, respectively:

  1. A header that starts with @ and e.g. uniquely identifies the read
  2. The sequence itself
  3. A + (plus sign)
  4. One-character quality scores for each base (hence FASTQ as in “Q” for “quality”)

FASTQ quality scores

The quality scores we saw in the read on the previous slide represent an estimate of the error probability of the base call.

Specifically, they correspond to a numeric “Phred” quality score (Q), which is a function of the estimated probability that a base call is erroneous (P):

Q = -10 * log10(P)


For some specific probabilities and their rough qualitative interpretation for Illumina data:

Phred quality score Error probability Rough interpretation
10 1 in 10 terrible
20 1 in 100 bad
30 1 in 1,000 good
40 1 in 10,000 excellent

FASTQ quality scores (cont.)

This numeric quality score is represented in FASTQ files not by the number itself, but by a corresponding “ASCII character”.

This allows for a single-character representation of each possible score — as a consequence, each quality score character can conveniently correspond to (& line up with) a base character in the read.

Phred quality score Error probability ASCII character
10 1 in 10 +
20 1 in 100 5
30 1 in 1,000 ?
40 1 in 10,000 I

A rule of thumb

In practice, you almost never have to manually check the quality scores of bases in FASTQ files, but if you do, a rule of thumb is that letter characters are good (Phred of 32 and up).

FASTQ (cont.)

FASTQ files have no size limit, so you may receive a single file per sample, although:

  • With paired-end (PE) sequencing, forward and reverse reads are split into two files:
    forward reads contain R1 and reverse reads contain R2 in the file name.

  • If sequencing was done on multiple lanes, you get one (SE) or two (PE) files per lane per sample.1


FASTQ files have the extension .fastq or .fq (but are commonly compressed, leading to fastq.gz etc.). All in all, having paired-end FASTQ files for 2 samples could look like this:

# A listing of (unusually simple) file names:
sample1_R1.fastq.gz
sample1_R2.fastq.gz
sample2_R1.fastq.gz
sample2_R1.fastq.gz

GTF/GFF

The GTF and GFF formats are tab-delimited tabular files that contain genome annotations, with:

  • One row for each annotated “genomic feature” (gene, exon, etc.)

  • One column for each piece of information about a feature, like its genomic coordinates

See the sample below, with an added header line (not normally present) with column names:

seqname     source  feature start   end     score  strand  frame    attributes
NC_000001   RefSeq  gene    11874   14409   .       +       .       gene_id "DDX11L1"; transcript_id ""; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11 like 1 (pseudogene)"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true"; 
NC_000001   RefSeq  exon    11874   12227   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; 

Some details on the more important/interesting columns:

  • seqname — Name of the chromosome, scaffold, or contig
  • feature — Name of the feature type, e.g. “gene”, “exon”, “intron”, “CDS”
  • start & end— Start & end position of the feature
  • strand — Whether the feature is on the + (forward) or - (reverse) strand
  • attribute — A semicolon-separated list of tag-value pairs with additional information

SAM/BAM

Using specialized bioinformatics tools, you can align HTS reads (in FASTQ files) to a reference genome assembly (in a FASTA file).

The resulting alignments are stored in the SAM (uncompressed) / BAM (compressed) format.


SAM/BAM are tabular files with one line per alignment, each of which includes:

  • The position in the genome that the read aligned to

  • A mapping score based on the length of the alignment and the number of mismatches

  • The sequence of aligned the read itself


File conversions

  • FASTQ files can be converted to FASTA files (losing quality information) but not vice versa

  • SAM/BAM files can be converted to FASTQ files (losing alignment information) but not vice versa

  • Proteome FASTA files can be produced from the combination of a FASTA genome assembly and a GFF/GTF genome annotation